Rapid Learning with Stochastic Focus of Attention 



Raphael Pelossof 

Comp. Bio. Sloan Kettering 

Zhiliang Ying 

Statistics Department Columbia University 



PELOSSOFOCBIO. MSKCC.ORG 



ZYING@STAT.COLUMBIA.EDU 



Abstract 

We present a method to stop the evaluation 
of a decision making process when the re- 
sult of the full evaluation is obvious. This 
trait is highly desirable for online margin- 
based machine learning algorithms where a 
classifier traditionally evaluates all the fea- 
tures for every example. We observe that 
some examples are easier to classify than oth- 
ers, a phenomenon which is characterized by 
the event when most of the features agree 
on the class of an example. By stopping 
the feature evaluation when encountering an 
easy to classify example, the learning algo- 
rithm can achieve substantial gains in com- 
putation. Our method provides a natural at- 
tention mechanism for learning algorithms. 
By modifying Pegasos, a margin-based online 
learning algorithm, to include our attentive 
method we lower the number of attributes 
computed from n to an average of 0{^/n) 
features without loss in prediction accuracy. 
We demonstrate the effectiveness of Atten- 
tive Pegasos on MNIST data. 



1. Introduction 

The running time of margin based online algorithms is 
a function of the number of features, or the dimension- 
ality of the input space. Since models today may have 
thousands of features, running time seems daunting, 
and depending on the task, one may wish to speed- 
up these online algorithms, by pruning uninformative 
examples. We propose to early stop the computation 
of feature evaluations for uninformative examples by 



1982) to margin based learning algorithms. 



connecting Sequential Analysis ( Wald, 1945 Lan et al. 
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Many decision making algorithms make a decision by 
comparing a sum of observations to a threshold. If the 
sum is smaller than a pre-defined threshold a certain 
action is taken, otherwise a different action is taken 
or no action is taken. This type of reasoning is preva- 
lent in the margin-based Machine Learning commu- 
nity, where typically an additive model is compared to 
a threshold, and a subsequent action is taken depend- 
ing on the result of the comparison. Margin-based 
learning algorithms average multiple weak hypotheses 
to form one strong combined hypothesis - the major- 
ity vote. When training, the combined hypothesis is 
usually compared to a threshold to make a decision 
about when to update the algorithm. When testing, 
the combined hypothesis is compared to a threshold 
to make a predictive decision about the class of the 
evaluated example. 

With the rapid growth of the size of data sets, both 
in terms of the number of samples, and in terms of 
the dimensionality, margin based learning algorithms 
can average thousands of hypotheses to create a single 
highly accurate combined hypothesis. Evaluating all 
the hypotheses for each example becomes a daunting 
task since the size of the data set can be very large, in 
terms of number of examples and the dimensionality of 
each example. In terms of number of examples, we can 
speed up processing by filtering out un- informative ex- 
amples for the learning process from the data set. The 
measure of the importance of an example is typically 
a function of the majority vote. In terms of dimen- 
sionality, we would like to compute the least number 
of dimensions before we decide whether the majority 
vote will end below or above the decision threshold. 
Filtering out un-informative examples, and trying to 
compute as few hypotheses as possible are closely re- 
lated problems ( |Blum fc Langley] 1997). The decision 
whether an example is informative or not depends on 
the magnitude of the majority rule, that is the amount 
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of disagreement between the hypotheses. Therefore, 
to find which example to filter, the algorithm needs 
to evaluate its majority vote, and we would like it to 
evaluate the least number of weak hypotheses before 
coming to a decision about the example's importance. 

Majority vote based decision making can be general- 
ized to comparing a weighted sum of random variables 
to a given threshold. If the majority vote falls below 
the pre-specified threshold a decision is made, typi- 
cally the model is updated, otherwise the example is 
ignored. Since, the majority vote is a summation of 
weighted random variables, or averaged weak hypothe- 
ses, it can be computed sequentially. Sequential Anal- 
ysis allows us to develop the statistical needed to speed 
up this evaluation process when its result is evident. 

We use the terms margin and full margin to describe 
the summation of all the feature evaluations, and par- 
tial margin as the summation of a part of the feature 
evaluations. The calculation of the margin is broken 
up for each example in the stream. This break-up 
allows the algorithm to make a decision after the eval- 
uation of each feature whether the next feature should 
also be evaluated or the feature evaluation should be 
stopped, and the example should be rejected for lack 
of importance in training. By making a decision after 
each evaluation we are able to early stop the evaluation 
of features on examples with a large partial margin af- 
ter having evaluated only a few features. Examples 
with a large partial margin are unlikely to have a full 
margin below the required threshold. Therefore, by 
rejecting these examples early, large savings in com- 
putation are achieved. 

This paper proposes several simple novel methods 



based on Sequential Analysis (Wald 1945 Lan et al. 



1982) and stopping methods for Brownian Motion 



to drastically improve the computational efficiency of 
margin based learning algorithms. Our methods ac- 
curately stop the evaluation of the margin when the 
result is the entire summation is evident. 

Instead of looking at the traditional classification er- 
ror we look at decision error. Decision errors are er- 
rors that occur when the algorithm rejects an example 
that should be accepted for training. Given a desired 
decision error rate we would like the test to decide 
when to stop the computation. This test is adap- 
tive, and changes according to the partial computa- 
tion of the margin. We demonstrate that this simple 
test can speed-up Pegasos by an order of magnitude 
while maintaining generalization accuracy. Our novel 
algorithm can be easily parallelized. 



2. Related Work 

Margin based learning has spurred countless algo- 
rithms in many different disciplines and domains. 
Typically a margin based learning algorithm evaluates 
the sign of the margin of each example and performs 
a decision. Our work provides early stopping rules for 
the margin evaluation when the result of the full eval- 
uation is obvious. This approach lowers the average 
number of features evaluated for each example accord- 
ing to its importance. Our stopping thresholds apply 
to the majority of margin based learning algorithms. 

The most directly applicable machine learning algo- 
rithms are margin based online learning algorithms. 
Many margin based Online Algorithms base their 
model update on the margin of each example in the 
stream. Online algorithms such as Kivinen and War- 



muth's Exponentiated Gradient (Kivinen & Warmuth 



1997|and Oza and Russell's Online Boosting (Oza & 
Russell, 2001 ) update their respective models by using 



a margin based potential function. Passive online algo- 
rithms, such as Rosenblatt's perceptron (Rosenblatt, 
^1958J and Crammer etal's online passive- aggressive 



algorithms (Crammer et al. 2006), define a margin 



based filtering criterion for update, which only up- 
dates the algorithm's model if the value of the mar- 
gin falls below a defined threshold. All these algo- 
rithms fully evaluate the margin for each example, 
which means that they evaluate all their features for 
every example. Recently Shalev-Shwartz fc Srebro 



(2008); Shalev-shwartz et al. (2010) proposed Pega- 



sos, an online stochastic gradient descent based SVM 
solver. The solver is a stochastic gradient descent 
solver which produces a maximum margin classifier at 
the end of the training process. 

The above mentioned algorithms passively evaluate all 
the features for each given example in the stream. 
However, if there is a cost (such as time) assigned to a 
feature evaluation we would like to design an efficient 
learner which actively choose which features it would 
like to evaluate. Similar work on the idea of learning 
with a feature budget was first introduced to the ma- 



chine learning community in Ben-David fc Dichterman 



(1998). The authors introduced a formal framework 



for the analysis of learning algorithm with restrictions 
on the amount of information they can extract. Specif- 
ically allowing the learner to access a fixed amount of 
attributes, which is smaller than the entire set of at- 
tributes. They presented a framework that is a natural 
refinement of the PAC learning model, however tradi- 
tional PAC characteristics do not hold in this frame- 



work. Very recently, both Cesa-Bianchi et al. (2010) 



and Reyzin (2010) studied how to efficiently learn a 
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Perceptron 



Informative example 
Uninformative example 



Feature evaluation 



Budgeted Perceptron 



Attentive Perceptron 



Evaluated feature □ Not-evaluated feature 



Figure 1. Two examples are classified. The first is hard to classify, the second easy. The budgeted learning approach 
would evaluate the same number of features for both examples, whereas the stochastic would evaluate features according 
to how hard is the example to classify, while maintaining an average budget. 



linear predictor under a feature budget (see figure [T]) 
Also, (Clarkson et al. 2010) extended the Perceptron 



algorithm to efficiently learn a classifier in sub-linear 
time. 

Similar active learning algorithms were developed in 
the context of when to pay for a label (as opposed 
to an attribute). Such active learning algorithms are 
presented with a set of unlabeled examples and de- 
cide which examples labels to query at a cost. The 
algorithm's task is to pay for labels as little as pos- 
sible while achieving specified accuracy and reliability 



rates (Dasgupta et al. 2005 Cesa-Bianchi et al. 2006). 



Typically, for selective sampling active learning algo- 
rithms the algorithm would ignore examples that are 
easy to classify, and pay for labels for harder to classify 
examples that are close to the decision boundary. 

Our work stems from connecting the underlying ideas 
between these two active learning domains, attribute 
querying and label querying. The main idea is that 
typically an algorithm should not query many at- 
tributes for examples that are easy to classify. The 
labels for such examples, in the label query active 
learning setting, are typically not queried. For such 
examples most of the attributes would agree to the 
classification of the example, and therefore the algo- 
rithm need not evaluate too many before deciding the 
importance of such examples. 

3. The Sequential Thresholded Sum 
Test 

The novel Constant Sequential Thresholded Sum Test 
is a test which is designed to control the rate of deci- 
sion errors a margin based learning algorithm makes. 
Although the test is known in statistics, it have never 
been applied to learning algorithms before. 



3.1. Mathematical Roadmap 

Our task is to find a filtering framework that would 
speed-up margin-based learning algorithms by quickly 
rejecting examples of little importance. Quick rejec- 
tion is done by creating a test that stops the margin 
evaluation process given the partial computation of the 
margin. We measure the importance of an example 
by the size of its margin. We define as the impor- 
tance threshold, where examples that are important to 
learning have a margin smaller than 0. Statistically, 
this problem can be generalized to finding a test for 
early stopping the computation of a partial sum of in- 
dependent random variables when the result of the full 
summation is guaranteed with a high probability. 

We look at decision errors of a sum of weighted in- 
dependent random variables. Then given a required 
decision error rate we will derive the Constant Sequen- 
tial Thresholded Sum Test (Constant STST) which will 
provide adaptive early stopping thresholds that main- 
tain the required confidence. 

Let the sum of weighted independent random variables 
{wi,Xi),i = l,...,n be defined by Sn = Yn^i'^iXi, 
where Wi is the weight assigned to the random vari- 
able Xi. We require that wi e R,Xi e [—1,1]. We 
define Sn as the full sum. Si as the partial sum, and 
Sin = Sn - Si = Z]j=i+i ^i^i the remaining sum. 
Once we computed the partial sum up to the ith ran- 
dom variable we know its value Si. Let the stopping 
threshold at coordinate i be defined by r^. We use 
the notation ESin to denote the expected value the 
remaining sum. 

There are four basic events that we are interested in 
when designing sequential tests which involve control- 
ling decision error rates. Each of these events is impor- 
tant for different applications under different assump- 
tions. The sequential method looks at events that in- 
volve the entire random walks, whereas the curtailed 
method looks at evens that accumulate information 
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as the random walk progresses. Let us establish the 
basic relationship between the sequential method, on 
the left hand side of the following equations, and the 
curtailed method on the right hand side: 

P{st0p\Sn < 0)P{Sn <0)= P{Sn < 0\stop)P{stop) , 

(1) 

where stop is the event which occurs when the partial 
sum crosses a stopping boundary stop = {Si > r^}. 

Previously (?) a Curved STST was proposed by look- 
ing at the following "curtailed" conditional probability 



P{Sn < O\stop) 



P{Sn < 0, stop) 

P{stop) * 



(2) 



The simplicity of deriving the curtailed method stems 
from the fact that the joint probability and the stop- 
ping time probability are not needed to be explic- 
itly calculated to upper bound this conditional. The 
resulting first stopping boundary, the curved bound- 
ary, gives us a constant conditional error probability 
throughout the curve, which meant that it is a rather 
conservative boundary. 

We develop a more aggressive boundary which allows 
higher decision error rates at the beginning of the ran- 
dom walk and lower decision error rates at the end. 
Such a boundary would essentially stop more random 
walks early on, and less later later on. This approach 
has the natural interpretation that we want to shorten 
the feature evaluation for obvious un-important sam- 
ples, but we want to prolong the evaluations for sam- 
ples we are not sure about. A Constant boundary 
achieves this exact "error spending" characteristic. 

3.2. The Constant Sequential Thresholded 
Sum Test 

We condition the probability of making a decision error 
in the following way 

I c ^ ^(stop before n, Sn < 0) 
P(stop before n\Sn < 0) = ^ • 

We stated in equation [3] a conditional probability func- 
tion which is conditioned on the examples of interest. 
Therefore in this case we are interested in limiting the 
decision error rate for examples that are important. To 
upper bound this conditional we will make an approx- 
imation that will allow us to apply boundary-crossing 
inequalities for a Brownian bridge. To apply the Brow- 
nian bridge to our conditional probability we approx- 
imate it by 



P(stopped before n\Sn < ^ 
= P(max5'i > r\Sn < 

i 

^ P(max6'i > r\Sn = 



If we assume that the event {Sn < 0} is rare (equiv- 
alently that EXi > 0), then we can approximate the 
inequality with an equality, which gives a Brownian 
bridge. Now we need to calculate boundary crossing 
probabilities of the Brownian bridge and a constant 
threshold. 

Lemma 1. The Brownian bridge Stopping Boundary. 
Let Tr = mf{i : Si = r} be the first hitting time of the 
random walk and constant r. Then the probability of 
the following decision error is 



P{Tr < n\Sn =0) = e"^^feS. 



Proof See Appendix. 



□ 



Theorem 1. The Simplified Contant STST boundary 
(0 = 0), T = ^/var{Sn) ^Jlog^ makes approximately 
S decision mistakes {T^ < n\Sn < 0}. 

Proof By approximating P{Si > T\Sn < 0) ^ P{Si > 
r\Sn = 0) and setting this probability to 5 we get the 
Contant STST boundary 



P{S, >T\Sn<0)^ exp 
Solving, 



2r(r 

var{Sr, 



•n) J 



t'^ — t6 — var{Sn)log—^ 

V 5 

{T-ef-^e' = var{Sn)log^ 



5. (7) 

(8) 
(9) 



T = e+J^e^ + var{S„)log^lO) 



If we simplify this boundary by setting 9 to zero, we 
''get the theorem's boundary 



^/var{Sn)Jlog^. 



□ 



(4) 
(5) 

(6) 



There are two appealing things about this boundary, 
the first that it's not dependent on ESi^ and that it is 
always positive. Secondly, when using this boundary 
for prediction, we can directly see the implication on 
the error rate of the classifier, since the decision error 
essentially becomes a classification error, a fact that is 
also clearly evident throughout the experiments. 
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Decision error rate for tlie Brownian bridge boundary 




Expected and actual stopping times for tine Brownian bridge boundary 



0.1 0.12 0.14 0.16 0.18 
delta 



(a) A simulation of the Brownian bridge boundary with 
Xi A/'(0.05, 1). The boundary is conservative. 




1000 2000 3000 4000 5000 6000 7000 8000 9000 
number of features n 

(b) The boundary behaves similarly to what's expected 
from theory. It computes in the order of 0{^/rl) fea- 
tures. 



Figure 2. Performance of the Brownian bridge boundary. 



3.3. Average Stopping Time for the Curved 
and Constant STST 

We are able to show that the expected number of fea- 
tures evaluated for the Curved and the Constant STST 
boundaries is in the order of 0{^/n). This can be ob- 
tained by limiting the range of values Xi can take. 
Theorem 2. Let \Xi\ < k, and let EXi > 0. Let 
the stopping time of the Brownian bridge be defined 
by t = mf{i : Si > ^/var{Sn)\og6-^■^}. Then the 
expected stopping time is in the order of 0{^/n). 



Proof. 



ESj 



< 
< 



ESt—1 
ESt—1 



EXj" 
k 



A/war(S'„) log (5-0-5 _^ 



(11) 
(12) 

(13) 



The second inequality holds since the random walk 
only crossed the boundary for the first time at time T 
and therefore was under the boundary at time T — 1. 
By applying Wald's Equation ESt = ETEX we get 



ET 



y^var{Sn)^og5 ^-^ + k 
EX 



< 



EX 



where c, k, and EX are constants. 



(14) 

(15) 
(16) 
□ 



See figure 2 (a) [ for simulation results. 



4. Attentive Pegasos 



Pegasos by ( Shalev-shwartz et al. 2010) is a simple 



and effective iterative algorithm for solving the opti- 
mization problem cast by Support Vector Machines. 
To solve the SVM functional it alternates between 
stochastic gradient descent steps (weight update) and 
projection steps (weight scaling). Similarly to the 
Perceptron these steps are taken when the algorithm 
makes a classification error while training. For a lin- 
ear kernel, the total run-time of Pegasos is 0((i/(Ae)), 
where d is the number of non-zero features in each ex- 
ample, A is a regularization parameter, and e is the 
error rate incurred by the algorithm over the optimal 
error rate achievable. By assuming that the features 
are independent, and applying the Brownian bridge 
STST we speed up Pegasos to 0{^/d/{Xe)) without 
losing significant accuracy. The algorithm Attentive 
Pegasos is demonstrated in Algorithm [l] 

4.1. Experiments 

We conducted several experiments to test the speed, 
generalization capabilities, and predictive accuracy of 
the STST. We ran Pegasos, Attentive Pegasos and 
Budgeted Pegasos on 1-vs-l MNIST digit classification 
tasks. We also tested different sampling and ordering 
methods for coordinate selection. 

Attentive Pegasos stops the computation of examples 
that were unlikely to have a negative margin early. 
We computed the average number of features the algo- 
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Algorithm 1 Attentive Pegasos 



Input: Dataset{X^^^}[^i,A,(5 

Initialize: Choose wi s.t. ||wi|| < 1/a/A, j = 

for / = 1, 2, m do 

if 3i = 1, ..,n s.t. y^^'^.^Wixl > 1 



Yl^=i ' vcLryi{xj)y^logd^^ then 
Update varyi{xj)^ j = 1,..,^ 

Jump to next example 
else 

Set fii = fx 

Set w^^i = (1 - /i^A)w/ + ij^iyyi 
i/Vx 



Set w^+i = min <j 1, 

end if 
end for 

return w^+^ 



rithm computed for examples that were filtered, and 
compare it to a budgeted version of Pegasos, where 
only a fixed number of features are evaluated for any 
example. We ran 1-vs-l digit classification problems 
under different feature selection policies. With the first 
policy, we sorted the coordinates, such that coordi- 
nates with a large absolute weight before other features 
with lesser absolute weight. Then with the second, we 
ran experiments where the coordinates were selected 
by sampling from the weight distribution with replace- 
ment. Finally, with the third, we ran experiments 
where the coordinates were randomly permuted. 

For each one of these scenarios, three algorithms were 
run 10 times on different permutations of the datasets 
and their results were averaged (Figure [4j) We first 
ran Attentive Pegasus under each of the coordinate 
selection methods. Then we set the budget for Bud- 
geted Pegasus as the average number of features that 
we got through Attentive Pegasos. Finally, we ran the 
full Pegasus with a trivial boundary, which essentially 
computes everything. Both figures show that using the 
Brownian Bridge boundary can save in the order of lOx 
computation, and maintain similar generalization re- 
sults to the full computation. Also, sorting under the 
Budgeted Pegasos is impossible since we need to learn 
the weights in order to sort them. Therefore we did 
not run Budgeted Pegasos with sorted weights. We can 
see in the middle subfigure, that the Attentive, Bud- 
geted, and Full algorithms maintain almost identical 
generalization results. However, when we early stop 
prediction with the resulting model. Attentive Pega- 
sos gives the best predictive results, even better than 



what we get with the full computation but only com- 
putes a tenth of the feature values! 

4.2. Conclusions 

We sped up online learning algorithms up to an order 
of magnitude without any significant loss in predictive 
accuracy. Surprisingly, in prediction Attentive Pegasos 
outperformed the original Pegasus, even though it. In 
addition, we proved that the expected speedup under 
independence assumptions of the weak hypotheses is 
0{^/n) where n is the set of all features used by the 
learner for discrimination. 

The thresholding process creates a natural attention 
mechanism for online learning algorithms. Examples 
that are easy to classify (such as background) are fil- 
tered quickly without evaluating many of their fea- 
tures. For examples that are hard to classify, the ma- 
jority of their features are evaluated. By spending lit- 
tle computation on easy examples and a lot of com- 
putation on hard "interesting" examples the Atten- 
tive algorithms exhibit a stochastic focus-of-attention 
mechanism. 

It would be interesting to find an explanation for our 
results where the attentive algorithm outperforms the 
full computation algorithm in the prediction task. 

5. Appendix 

Lemma 2. Brownian bridge boundary crossing prob- 
ability 



P{Tr <n\Sn = 0) =exp' 



2r(r - 0) 
var{Sn) 



(17) 



Proof. We can look at an infinitesimally small area dO 
around 

P{T. <n\S^ = e)= '^^^p^'^'f^f (18) 

The numerator can be developed to 

P{Tr <n,S„ = 9) (19) 

= P{Tr < n)P{S„ G d0\Tr < n) (20) 

= P{Tr < n)P{S„ G 2t - de\Tr < n) (21) 

= P{Sn G 2t - dte, Tr < n) (22) 

= P{Sn G 2r - ^6*) (23) 

' ^dO (24) 



^/var{Sn) \^/var{Sn)^ 
Similarly, the denominator 

denommator = =(p = 

y^var{Sn) \y^var{Sn) 



dO 
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Test error % 



Prediction error % 
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0.5 
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3 .0_ 
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sort sample rand 


sort sample rand 


sort sample rand 


^^^H bbBoundary \Z 


1 budqetedBoundarv 


^^^H noBoundary 



Figure 3. Results for Attentive Pegasos, MNIST 2 vs 3, ^ = 10%. Our Brownian bridge decision boundary (blue) processes 
only 49 feature on average (15 times faster than full computation), while achieving similar generalization as the fully trained 
classifier (red, middle subfigure). On the right subfigure, when the boundary is applied to prediction. Attentive Pegasus 
achieves a lower error rate than the full computation, and less than half the error of the Budgeted Boundary (green). 



Plugging back into 18 



P{Tr <n\Sn= 0) 

2r-6> 



r l(2r 
[ 2 vai 

exp 



(25) 
(26) 



var{Sn) 
\ var{Sn) ) 



-^^1 (27) 
(28) 



Definition 1. A random variable T which is a fun^ 
tion of Xi^ ... is a stopping time ifT has nonegative 
integer values and for a// n = 1, 2, ... there is an event 
An such that T < n if and only if (Xi, ...,Xn) G An, 
while for n = 0, {T = 0} is either empty (the usual 
case) or the whole space. For a nonnegative integer- 
valued random variable X , we have 



i=i 



(29) 



j=i k>j 

CO k 



k=l 



J2 kP{X = k)=EX < 



-oo 



(30) 



Lemma 3. ( (Wald\ \T94^ , proof from (1)) Wald's 
Identity. Let St be a sum of independent identi- 
cally distributed random variables Xi + ... where 
EXi < oo. Let Sn = Xi^ ... + X^, i > T and So = 0. 
Let T = {infi Si = a}^T > where a is a constant be a 
random variable with ET < oo. Then ESt = ETEX. 

Proof. If T = the identity is trivial. Otherwise 

oo 

ESt = ^P(T = n)^(Xi + ... + Xn|T = r<pl) 



n=l 

oo 



^P{T = n)^E{X,\T = n) 



oo oo 



^^P(T = n)i?(X,.|r = n) 

j = l n=j 
oo oo 

^^i;(x,ir=„) 

j = l n=j 
oo 

J2E{XAT>j) 



(32) 
(33) 
(34) 
(35) 
(36) 



= ^i?(X,(l-lT<,-l). 

The event {T < j — 1} is independent of Yj, therefore 



ESt = Y.^{Xj)P{T > j) 

i=i 



(37) 



/c=l 



EX P{T > j) = EXET (38) 
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Figure 4. Results for Attentive Pegasos. MNIST 3 vs 10, 6 = 10%. Our Brownian bridge decision boundary (blue) 
processes only 72 feature on average, while achieving similar generalization as the fully trained classifier. On the right, 
when the boundary is applied to classification. Attentive Pegasus gets a lower error rate than the full computation, and 
over a 2% advantage over the Budgeted Boundary. 



by definition [T] □ 
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